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Abstract. Within the field of phylogenetics there is growing interest in measures for sum¬ 
marising the dissimilarity, or incongruence , of two or more phylogenetic trees. Many of 
these measures are NP-hard to compute and this has stimulated a considerable volume of 
research into fixed parameter tractable algorithms. In this article we use Monadic Second 
Order logic (MSOL) to give alternative, compact proofs of fixed parameter tractability for 
several well-known incongruency measures. In doing so we wish to demonstrate the consid¬ 
erable potential of MSOL - machinery still largely unknown outside the algorithmic graph 
theory community - within phylogenetics. A crucial component of this work is the observa¬ 
tion that many of these measures, when bounded, imply the existence of an agreement forest 
of bounded size, which in turn implies that an auxiliary graph structure, the display graph , 
has bounded treewidth. It is this bound on treewidth that makes the machinery of MSOL 
available for proving fixed parameter tractability. We give a variety of different MSOL for¬ 
mulations. Some are based on explicitly encoding agreement forests, while some only use 
them implicitly to generate the treewidth bound. Our formulations introduce a number of 
“phylogenetics MSOL primitives” which will hopefully be of use to other researchers. 


1 Introduction 

The central goal of phylogenetics is to accurately infer the evolutionary history of a set of species 
(or taxa) X from incomplete information. Classically, phylogenetic reconstruction has access to 
information about each element in X , such as DNA data, and seeks to infer a phylogenetic tree 
- a tree whose leaves are bijectively labeled by X - that best fits this data. There is a vast lit¬ 
erature available on this topic and many different algorithms exist for constructing phylogenetic 
trees [16126] . In practice, it is not uncommon for phylogenetic analysis to generate multiple phy¬ 
logenetic trees as output. This can occur for various reasons, ranging from software engineering 
choices (many tree-building packages are designed to generate multiple optimal and near-optimal 
solutions) to more structural explanations (reticulate evolutionary signals that are comprised of 
multiple distinct tree signals). Given two (or more) distinct phylogenetic trees, it is natural to 
compare them to determine whether the difference is significant. This explains the interest of 
the phylogenetics community for measures that can quantify the dissimilarity, or incongruence , of 
phylogenetic trees [2D]. Some of these measures (such as Tree Bisection and Reconnection 
distance |T]) are studied to better understand how local-search heuristics, based on rearrangement 
operations, navigate the space of phylogenetic trees (e-g., 0). Others, such as Hybridization 
Number (8], are studied because they assist with the inference of phylogenetic networks, which 
generalise phylogenetic trees to directed acyclic graphs [2DI2T1 . 

Unfortunately, many of these measures are NP-hard and APX-liard to compute. On the pos¬ 
itive side, however, the phylogenetics community has been quite successful in proving that these 
measures are fixed parameter tractable (FPT) in their natural parameterizations. Informally this 
means that a measure that evaluates to k can be computed in time f(k) -poly{n) where / is some 
function that only depends on k and n is the size of the instance (often taken to be |X|). Such 






running times have the potential to be much faster than running times of the form 0(n^ k ^) when 
the measure in question is comparatively small. See e.g. m for more background on FPT. A 
number of state-of-the-art phylogenetics software packages are based on FPT algorithms, such as 
the software used in [28]. Most FPT results in the phylogenetics literature are based on classical 
proof techniques such as polynomial-time kernelization and bounded-search. 

Parallel to all of this, algorithmic graph theorists have made great steps forward in identifying 
sufficient, structural conditions under which NP-hard problems on graphs become (fixed param¬ 
eter) tractable. At the heart of this research lies the width parameter, the most famous example 
being treewidth. Informally treewidth is a measure that quantifies the dissimilarity of a graph 
from being a tree. The notion of treewidth, which is most famously associated with the celebrated 
Graph Minors project of Robertson and Seymour [2'4j , has had a profound impact upon algorithm 
design. A great many NP-hard problems turn out to become tractable on graphs of bounded 
treewidth, using broadly similar proof techniques i.e. dynamic programming on tree decomposi¬ 
tions (3j. This contributed to the rise of meta-theorems, the archetypal example being Courcelle’s 
Theorem [he]. This states, when combined with the result from [5], that any graph property 
that can be abstractly formulated as a length £ sentence of Monadic Second Order logic (MSOL), 
can be tested in time f{t,£) ■ 0(n ) on graphs of treewidth t. where n is the number of vertices 
in the graph. When t and l are both bounded by a function of a single parameter p, this yields 
a running time of the form f(p) ■ 0{n) i.e. linear-time fixed parameter tractability in parameter 
p. This is an extremely powerful technique in the sense that it completely abstracts away from 
ad-hoc algorithm design and permits highly compact, “declarative” proofs that a problem is FPT. 
Courcelle’s Theorem (and its variants) are more than two decades old, but their potential is rarely 
exploited by the phylogenetics community. One exception is the literature on unrooted compati¬ 
bility., which asks whether a set of unrooted phylogenetic trees are compatible. The FPT proof by 
Bryant and Lagergren [TO] proves that the display graph (the graph obtained by identifying all 
taxa with the same label) has bounded treewidth (in the number of input trees), and then gives 
an MSOL formulation which tests compatibility. A follow-up result by the present authors applies 
a similar approach [25]. 

In this article we show that this technique has much broader potential within phylogenetics. To 
clarify the exposition we focus on binary trees (both rooted and unrooted) on the same set of taxa 
X. We begin by proving that if two trees have an agreement forest of size k - essentially a partition 
of the trees into k isomorphic subtrees - the treewidth of the display graph is bounded by a function 
of k. This simple observation is significant because of the prominent role of agreement forests 
within the phylogenetics literature. We use this insight to re-analyse three well-known NP-hard 
phylogenetics problems that were previously shown to be FPT using more conventional analysis. 
In particular, we give MSOL formulations for (1) Unrooted Maximum Agreement Forest 
(uMAF), which is equivalent to the problem of computing Tree Bisection and Reconnection 
distance (TBR) on unrooted trees, (2) rooted Maximum Agreement Forest (rMAF), which 
is equivalent to the problem of computing Rooted Subtree Prune and Regraft distance 
(rSPR) on rooted trees, and (3) Hybridization Number (HN) on rooted trees. The formulations 
for uMAF and rMAF are based on explicitly modelling agreement forests using quartets and edge 
cuts. The formulation for HN uses agreement forests implicitly to obtain the treewidth bound 
but, due to the difficulties in encoding acyclic agreement forests, then bypasses the agreement 
forest abstraction. Instead, it encodes an equivalent, “elimination ordering” formulation of HN 
which considers sequences of pruned common subtrees. Finally we consider the (4) Maximum 
Parsimony Distance on Binary Characters problem. This asks for a binary character / on 
X that maximizes the absolute difference between the parsimony score of / on the two trees. It 
is NP-hard but not known to be FPT (in the parsimony distance). Here we give an optimization 
MSOL formulation which shows that the problem is FPT in parameter uMAF. Although this does 
not settle whether the natural parameterization of the problem is FPT, it does demonstrate a 
number of interesting principles. Firstly, it demonstrates the power of “simulating” the execution 
of polynomial-time algorithms (in this case, Fitch’s algorithm [18]) within MSOL. Secondly, any 
subsequent proof that TBR distance is at most a bounded distance above d 2 MP distance and/or 




that dr MP distance induces bounded treewidth display graphs, will automatically prove that cr MP 
distance is FPT in its natural parameterization. 

Summarizing, our formulations show the potential for MSOL to generate compact, logical FPT 
proofs for phylogenetics problems. The machinery of MSOL does not yield practical algorithms 
but it is an excellent classification tool. Once the existence of FPT algorithms has been confirmed 
via MSOL one can then switch efforts to finding a good FPT algorithm by more direct analysis, 
possibly (but not exclusively) through direct analysis of tree decompositions. Our formulations 
also introduce a number of phylogenetics “primitives” concerning quartets, clusters, subtrees and 
compatibility that we hope will be of use to other phylogenetics researchers. 

2 Preliminaries 

In this section, we define the main objects that will be manipulated in this paper. 

An unrooted phylogenetic tree T (unrooted tree for short) is a tree in which no vertex has 
degree 2 and in which the leaves are bijectively labeled by a label set C(T). The leaf labels are 
often called taxa and the symbol A' is frequently used as shorthand for C(T). Internal vertices are 
not labeled. A rooted phylogenetic tree (rooted tree for short) is defined similarly, except that it 
has exactly one vertex, called the root of the tree, that is permitted to have degree 2, and edges 
are directed away from the root. An unrooted tree is binary if every internal vertex has degree 3, 
and a rooted tree is binary if each internal vertex has indegree 1 and outdegree 2, and the root 
has outdegree 2 and indegree 0. 

Given an unrooted tree T and a subset Y C C(T), we use T(Y) to denote the minimal subtree 
of T connecting Y. Moreover, we denote by T\y the tree obtained from T{Y) when suppressing 
vertices of degree 2. We say that T\ Y is the subtree of T induced by Y. In graph theory terms, 
T\y is a label-preserving topological minor of T. Induced subtrees are defined in the same way for 
rooted trees, except that the root of T\y becomes the vertex in the minimal connecting subgraph 
that is closest to the root of T, and we suppress all degree 2 vertices except the new root. We 
write T — Y to denote T \c(T)-y- For any node u of a rooted tree T, T u is the subtree of T rooted 
at u. 



Fig. 1. (a) Two unrooted binary phylogenetic trees on {u,v,w,y}. A maximum agreement forest (uMAF) 
for these two trees contains 2 components, and can be obtained by cutting the single red edge in both trees 
and then suppressing the resulting degree 2 vertices, (b) The display graph for the two trees from (a), 
obtained by identifying leaves with the same label, (c) How addition of a special label p can be used to root 
a tree: edges are assumed to be directed away from p. 


Given a label set X , a bipartition (or split ) A\B on A is a partition of X into two non-empty 
sets. Each edge { u , v} of a tree T induces a split C(T U )\C(T V ), where T u and T v are the two trees 
obtained from T when {it, u} is deleted. Given a rooted tree T with label set A, a subset X' of X 
is called a clade (or cluster ) of T, if T contains a node v such that C(T V ) = X'. 

Given an unrooted binary tree T and a set of four distinct labels {u, v, w, y} in £(T), T\{ u , VtUltV } 
will be exactly one of the three possible unrooted binary trees on {u,v,w,y}. These are called 







quartets and are denoted respectively by uv\wy , uw\vy and wv\uy, depending on the bipartition 
induced by its central edge. In Figure Ea) we see uv\wy and uw\vy. Given a rooted binary tree 
T and a set of three labels {u,v,w} in £(T), T\{ u ,v,w} will be exactly one of the three possible 
rooted binary trees on {u,v,w}. These are called triplets and are denoted respectively by uv\w, 
uw\v and wv\u, where ij\k means that the leaf labelled k is incident to the root. 

Let T = {Ti,T 2 , ... ,Xfc} be a collection of unrooted trees, not necessarily on the same set 
of taxa. The display graph of T is obtained from the disjoint graph union of all trees in T by 
identifying vertices with the same label; see Figure [T{b). 

Given an undirected graph G = (V,E), a bag is simply a subset of V. A tree decomposition 
of G consists of a tree Tq = {V(Tq), E(Tg)) where V{Tq) is a collection of bags such that the 
following holds: (1) every vertex of V is in at least one bag; (2) for each edge {u,v} £ E , there 
exists some bag that contains both u and v; (3) for each vertex u £ V, the bags that contain u 
induce a connected subtree of Tq- The width of a tree decomposition is equal to the cardinality of 
its largest bag, minus 1. The treewidth of a graph G is equal to the minimum width, ranging over 
all possible tree decompositions of G. A tree with at least one edge has treewidth 1. For a fixed 
value of k one can determine in linear time whether a graph has treewidth at most k [S| . 


3 Main results 

Unless stated otherwise, we assume that Tj = (V\,E{) and T 2 = (V 2 ,E 2 ) are both unrooted 
binary trees on X. Their display graph is denoted by D = (V, E) and R D denotes the vertex-edge 
incidence relation in D. We use adj to denote the vertex-vertex adjacency relation in D. Note that 
|Vj = 3|A| - 4 and |£j = 4|A'| - 6. 


3.1 TBR / MAF on unrooted trees 

We will start by giving the definitions of a TBR move and of the TBR distance between two 
unrooted binary trees. 

Definition 1 (TBR move). Given an unrooted binary tree T, a tree bisection and reconnection 
(TBR) move on T consists of removing an edge ofT, say { it, v }, and then reconnecting the subtrees 
T u and T v as follows: subdividing an edge of T u with a new vertex p; subdividing an edge of T v 
with a new vertex q; connecting p to q; and finally suppressing any vertices of degree 2. 

TBR. distance is then defined naturally as follows: 

Problem: dTBR.{Ti,T 2 ) 

Input: Two unrooted binary trees Ti, T 2 on the same set of taxa X. 

Output: The minimum number of TBR moves required to transform Ti into T 2 . 

We will now give the definition of an uMAF for two unrooted binary trees Ti, T 2 on A'. Any 
collection of trees whose label sets partition X is said to be a forest on X. Furthermore, we say 
that a set F = {Ti,..., Fk} of unrooted binary phylogenetic trees - with \JF\ referred to as the 
size of T - is a forest for T if T can be obtained from T by deleting a (k — l)-sized subset E of 
E(T ), suppressing any unlabeled leaves, and then finally suppressing any vertices with degree 2. 
To ease reading, we write T = T — E if T can be obtained in this way. 

Definition 2 (uMAF). A set T of unrooted trees is an agreement forest for T\ and T 2 (denoted 
uAF) if J- is a forest of both T\ and T 2 . An unrooted maximum agreement forest (uMAF), is an 
uAF of minimum size. 


So, the uMAF problem is defined as follows: 


Problem: uM AF(Ti,T 2 ) 

Input: Two unrooted binary trees Ti, T 2 on the same set of taxa X. 

Output: An uMAF for Ti and T 2 . 

The two problems defined above are closely related, and known to be NP-hard [T]. 

Theorem 1 ([!]). Given two unrooted binary trees Ti, T 2 on the same set of taxa X, we have 
that d TBR (T 1 ,T 2 ) = \uMAF(T 1 ,T 2 )\-l. 

Fortunately, they have been proved to be FPT in their natural parameters pQ, and fast algo¬ 
rithms have been recently proposed 1271131 . In this section, we will give a more compact proof of 
their fixed parameter tractability. 

Theorem 2. Let T\, T 2 be two unrooted binary trees on the same set of taxa X such that a uAF 
of size k for these two trees exists. Then, the treewidth of their display graph D is at most k + 1. 

Proof. From |TS], we know that the display graph of two identical trees has treewidth 2 (or 1 in the 
case that both trees consist of a single vertex). Thus, if we have an uAF F = (Ti ,.... F^} of size k, 
this means that the display graph D 0 of F (which we define as the display graph constructed from 
two disjoint copies of F) has k connected components, and treewidth at most 2. This is because 
the treewidth of a disconnected graph is equal to the largest treewidth ranging over its connected 
components. Now, we can construct a tree decomposition of D from the tree decomposition of F 
as follows: suppose F can be obtained by removing from Ti, respectively T 2 , a subset of edges 
Ki, respectively K 2 , and suppressing vertices with degree 2 and unlabeled leaves. First, note that 
we can reintroduce the suppressed vertices (and their corresponding edges) in F, obtaining a new 
forest F', without changing the treewidth. Indeed, given an edge {u, w} in F that corresponded to 
a path ( u , *!,-•• , Xj, v) before the suppression of the vertices with degree 2, we know that there 
exists a bag B in the tree decomposition of Dq such that u and v are in B. Then we can add a set 
of bags {B 1 , ■ ■ ■ ,Bj} such that B 1 = {it, x\, v}, B 2 = {xi,x 2 ,v}, • • • , Bj = {xj-i,Xj,v}, and add 
edges {B , B{\, {Bi,B 2 }, ■ ■ ■ , {Bj_i, Bj} to the tree decomposition. For the suppressed unlabeled 
leaves, say u, this is even easier: we add a bag {u, v} as child of any of the bags containing v , where 
v is the vertex from which the suppressed leaf was hanging. It is easy to see that this is a tree 
decomposition of the display graph of F' with treewidth 2. Now, we can easily reintroduce the k— 1 
edges in K\ to the display graph, again without changing the treewidth, by, for each edge {u, v} 
in A'i, adding a bag {u,i>} between two existing bags, one containing u and the other containing 
v. Note that the obtained decomposition is still a tree, since we are connecting two components 
of F'. Now, when adding back the edges of K 2 , this is not true anymore. In this case, there exists 
at least a path in the tree decomposition, connecting a bag containing u to a bag containing v. 
Then, taking the shortest of these paths and adding u to its bags not containing u , we increase 
the treewidth by at most 1. If we do this for all edges in K 2l we obtain a tree decomposition for 
the display graph of T\ and T 2 with treewidth at most 2 + (k — 1) = k + 1. Note that this bound is 
tight, as the following example shows: an uMAF of two quartets with different topologies, uv\wx 
and ux\vw say, contains 2 components, and the display graph of these two quartets has treewidth 
3 (see also ,12). □ 

In the following, we will demonstrate that \uMAF{T\, T 2 )\ =: k can be computed in time 
0(f{k) • |X|) for some computable function / that depends only on k. We do this via the machinery 
of MSOL. The high-level idea is that we formulate a logical query to answer the question “Is 
k < k'?” for increasing values of k' until the answer is yes , and then stop: at this point k' = k. 
We use the stronger variant of MSOL that allows quantification over both edges and vertices. In 
particular, we will use the extended MSOL framework of Arnborg et al [2j. Following H0l25| we 
note that the sets Vi , E\ ,V 2 ,E 2 ,X (and later, p) are all available to the MSOL query i.e. within 
the query we can distinguish which vertices/edges of D belong to Ti, which belong to T 2 , and 
which are taxa. 






More formally, we construct an MSOL formula <&(Ki,K 2 ) and a relational structure G such 
that G |= $(Ki,K 2 ) if and only if Ad is a set of k' — 1 edges of Ei, and K 2 is a set of k' — 1 edges 
of E 2 , such that, after deleting them, the resulting components form an uAF E for both Tf and 
T 2 , i.e. Ei = T± — Ki = E 2 = T 2 — K 2 . To model this, we need to have that: (1) the two forests 
Ei and E 2 induce an identical partition of X and (2) the components of the two induced forests 
must have the same topology. To enforce (1) we observe that (in, say, Ti) two taxa Xi and x 2 
are in the same component of the forest resulting from deletion of I\i if and only if they can still 
reach each other inside Ti after deletion of those edges. In turn, this occurs if and only if there is 
a path from xi to x 2 entirely contained inside Ti which avoids all the edges in Ki. To enforce (2) 
we demand that a quartet is in the first forest (i.e. the quartet is contained inside one of the trees 
in the forest) if and only the quartet is in the second forest. This uses the fact that two unrooted 
binary trees on the same set of taxa are topologically identical if and only if they induce identical 
sets of quartets [12] . 

Before defining <P(Ki,K 2 ), we need to introduce several intermediate predicates. These build 
on a number of basic predicates which we mainly list for the benefit of readers not familiar with 
MSOL. They are used to: 


— test that Z is equal to the union of two sets P and Q: 

P U Q = Z := Vz(z £Z^z£PVz£Q)/\ Vz(z £ P => z £ Z) /\ Vz(z £ Q =>■ z £ Z) 

— test that P D Q = 0: 

NoIntersect(P , Q) := Mu £ P(u qL Q) 

— test that P 0 Q = {u}: 

Intersect(P, Q , v ) := (v £ P) A (v £ Q) A Mu £ P(u £ Q => (u = v)) 

— test if the sets P and Q are a bipartition of Z: 

Bipartition(Z , P, Q) := (P U Q = Z) A NoIntersect{P, Q) 

— test if the elements in {xi, x 2 , X 3 , X 4 } are pairwise different: 

allDiff(x 1 , x 2 , x 3 , x A ) := Aq^'e{i, 2 , 3 , 4 } x i A x j 

— check if the nodes p and q are adjacent in D: 

adj(p, q) := 3e £ E(R D (e,p) A R D (e,q)) 

The predicate PAC(Z, xi, x 2l Ki) (“path avoids cuts?’’’) asks: is there a path from x\ to x 2 
entirely contained inside vertices Z that avoids all the edges A',? We model this by observing that 
this does not hold if you can partition Z into two pieces P and Q , with xi £ P and x 2 £ Q , such 
that the only edges that cross the induced cut (if any) are in K ,: 

PAC(Z, xi,x 2 , Ki) := (xi = x 2 ) V -GP, Q(Bipartition(Z , P,Q) A aq £ P A x 2 £ Q A 
(Vp, q(p £ P A q £ Q => ~>adj(p, q) V (3 g £ Iu(R D (g,p) 

A R D (g,q))m 

We model that a quartet is in the forest (of, say, T \) by stipulating that there is an embedding 
(i.e. subdivision) of the quartet, completely contained inside Ti, which avoids all the edges in 
Ki. To model the embedding, we model the five edges of the quartet as five subsets of vertices 
A, B,C, D, P, representing the subdivisions of the five edges, with P being the central edge and 
u and v being its endpoints. We demand that (with the exception of u and v ) these subsets are 
disjoint. This is all combined in the following QAC 1 predicate (“quartet avoids cuts in Ti ?”), 
which returns true if and only if Ti contains an embedding of x a Xb\x c Xd that is disjoint from the 


edge cuts A'i. 

QAC 1 (x a , Xb,x c , Xd, K\) := 3u, v e Vi((u ^ v) A EL4, B,C,D,P C Vi(£ a , u G A A Xb,u £ B A x c , 

v € C A Xd , v£DAu£PAv£PA Inter sect(A, B , u) A 
Intersect(A , P, u) A Intersect(B, P, u) A Intersect{C , D, v) A 
Intersect(C, P, v ) A Inter sectfD, P, v) A NoIntersect(A, C ) A 
NoIntersect(B , C) A NoIntersect(A , D) A NoIntersectfB, D) A 
PAC{A , u, £ a , A'i) A PAC(B, u, x b , K{) A PAC(C , u, £ c , A'i) A 
PAC(D, v, x d , A'i) A PAC(P, u, v, A'i))) 

We can define QAC 2 (x a ,Xb,x c ,Xd, K 2 ) in a similar way. Note that, for every four taxa, we need 
to consider all three possible quartet topologies. Then we define ^(A'i, AT 2 ) as follows: 

( /\ |A'i| = k' - 1) A ( /\ 

»e{i, 2 } »e{i, 2 } 

PAC(V2,xi,X2,K2)) A Va;i,X2,a:3,2;4 G X(allDiff(x 1,^2,^3,^4) => 
A'i) Q^C 2 (a:i,a:2,a:3,®4, AT 2 )) A 
(QAC 1 ^^, £2,2:4, A'i QAC 2 (£i,£ 3 ,£ 2 ,a: 4 , A' 2 )) A 
(QAC^Oi, x 4 , £ 2 , £ 3 , A'i) 4=> QAC' 2 (£i, £ 4 , £ 2 , £ 3 , A' 2 )))). 


(The cardinality operator is permitted because the extended MSOL framework of [2j allows the 
incorporation of an evaluation relation which can test, amongst other things, the cardinalities of 
free set variables). 

Theorem 3. Computation of TBR / uMAF on two unrooted binary trees on the same set of taxa 
X is linear time FPT. That is, the optimum k can be computed in time 0(f(k) ■ |A'|) for some 
computable function that only depends on k. 

Proof. We have presented a logical query to answer the question “Is k < k'?” for increasing values 
of k'. For each value of k 1 the MSOL query, which examines the display graph D , has fixed length. 
Combining this with the fact that the treewidth of D is bounded by a function of k (by Theorem 
HD, and that the size of D is a linear function of |A'|, we have the desired result. (Note that the 
actual edge cuts - which can be used to construct a uMAF - can also be obtained in the same time 
bound by leveraging Theorem 4 of mao □ 

3.2 rSPR / MAF on rooted trees 

In this section, we will give a compact proof that the computation of rSPR distance is FPT in its 
natural parameter. Before that, we need to introduce some definitions. 

Definition 3 (rSPR move). Given a rooted binary tree T, a subtree prune and regraft (rSPR) 
move on T consists of removing an edge of T, say (u,v), yielding two trees T u and T v , and then 
reconnecting them as follows: subdividing some edge of T u with a new vertex p; adding an edge 
directed from p to v, and then suppressing any vertices with indegree and outdegree both equal to 

1 . 

rSPR distance is defined analogously to TBR distance, and a MAF for two rooted binary trees 
Ti, T 2 is defined similarly to a uMAF, but in a rooted framework. We refer to [ 6 ] for precise 
definitions. The main difference is that a forest consists of rooted binary trees and this has to be 
taken into account when comparing the topology of the components. In the rooted context MAFs 
are mainly studied because of their close relationship to rSPR distance. To accurately model rSPR 
distance it is necessary to slightly modify each input tree Ti as follows: we add a vertex with special 


label p at the end of a pendant edge adjoined to the original root of X*, see Figure Die). We then 
consider p to be part of the label set of the tree. Note that the addition of p means that we can 
equivalently view each Xi as an unrooted binary tree, with p acting as a placeholder for the root 
location, and this is how the trees will be modelled in the display graph. 

The close relationship between MAF (assuming p has been added as described) and rSPR 
distance is summarized by the following well-known result. 

Theorem 4 ([6]). Given two rooted binary trees T\, X 2 on the same set of taxa X, we have that 
d rS P R {T 1 ,T 2 ) = \MAF{T 1 ,T 2 )\-l. 


Note that these problems have been proved NP-liard and FPT in their natural parameter [6] - 

The MSOL formulation is similar to the TBR formulation, but with the following changes. 
When checking that the components induced by the edge cuts partition the taxa in the same way 
in both Xi and T 2 (i.e. by considering pairs of taxa that still have a path between them), we need 
to range over X Li {p} instead of just X. More significantly, we need predicates for triplets instead 
of quartets, because we are working in the rooted environment and two rooted binary trees are 
topologically equivalent if and only if they contain the same set of triplets Q33- Fortunately we can 
use the fact that triplet xy\z is in Ti (x,y,z € X) if and only if quartet xy\pz is in the unrooted 
interpretation of Ti. 

However we cannot simply use p as the fourth parameter Xd to QAC 1 because this will evaluate 
to false if the path from p to the rest of the quartet embedding has been cut. This is not what we 
need: p is in this context only there to indicate direction, so its particular arm of the quartet em¬ 
bedding can be cut without consequence. We can remedy this by introducing predicates Quartet 1 
and Triplet 1 which check whether the corresponding quartet/triplet was in the original tree (i.e. 
before the edge cuts). We can then leverage the fact that, if three distinct taxa x,y,z are in the 
same component of the forest, the unique triplet topology they induce within the component will 
be the same topology as they induced in the original tree. 

We first need the following predicate, which is a specialization of the earlier PAC predicate. 

It tests whether there is a path from X\ to x 2 that is entirely contained inside vertex set Z: 

path(Z , xi, x 2 ) := ( x\ = x 2 ) V ~3P, Q(Bipartition(Z, P, Q) A x\ € P A x 2 € Q 
A (Vp, q(p G P A q G Q =» adj(p , q)))) 

For each tree Tj, the following predicate checks whether the quartet x a Xb\x c Xd is contained in 

Tv. 

Quartet 1 (x a ,Xb, x c , Xd) '■= 3u, v € Vi((u ^ v) A 3A, B, C,D,P C Vi(x a , u € A A 

Xb, u € B A x c , v £ C A Xd, v£DAu£PAv€PA Intersect(A, B , u) A 
Intersect(A, P, u) A Intersect{B , P, u) A Intersect(C, D 1 v) A 
Intersect(C , P , v) A Intersect(D , P , v) A NoIntersect(A , C) A 
NoIntersect(B , C ) A NoIntersect(A, D) A NoIntersect{B , D ) A 
path(A , u, x a ) A path(B , u, Xb) A path(C, v, x c ) A path(D, v, Xd) A 
path(P , u, u))) 

For each rooted tree Ti, the following predicate checks whether the triplet x a Xb\x c is contained 
in Ti (simply by checking whether x a Xb\x c p is contained in it): 

Triplet l (x a , Xb, x c ) := Quartet 1 (x a , Xb, x c , p) 


Now, we are ready to define X AC 1 ( “triplet avoids cuts in Ti ?”), which models whether a triplet 
is in the forest of I) induced by the edge cuts: 


TAC l (x a , Xb, x c , K\) '■= Triplet l (x a , Xb, x c ) A 3m, v E Vi((u ^ v) A 3 A, B , C, D, P C Vi(x a , u E A A 
x b ,u £ B A x c , v £ C A p,v £ D Au £ P Av £ P A Intersect(A , B, u ) A 
Intersect(A, P, u) A Intersect(B , P , u) A Intersect(C, D, v) A 
Inter sect (C , P, u) A Inter sect{D , P, u) A NoIntersect(A , C) A 
NoIntersect(B , C) A NoIntersect(A, D) A NoIntersect{B , D) 

A PAC(A, u, x a , Ky) A PAC(B, u, x b , Ky) A PAC(C, v, x c , Ky) A 
path(D , v, p, /\i) A PAC(P, u, v, Ky))) 

Note how we use pat/i rather than PAC to model the path from v to p i.e. because it does not 
matter for the triplet whether this path is cut. The final MSOL formulation is then very similar 
to that given in Section 13.11 


( /\ |PT i | = fc'-l)A( /\ ifi CBijAVn,^ 6 lUM(PdC(hi,ii,x 2 ,A'i)o 

ie{l, 2 } * 6 ( 1 , 2 } 

P J 4 C'(V r 2 ,a:i,* 2 ,-fi 2 )) A Va;i, X 2 , 2:3 G X(allDiff(xy, x 2 , X 3 ) 

{(TAC 1 (xy,X 2 , x 3 , Ky) & TAC\xy,x 2 ,x 3 , I< 2 )) A 
(TAC 1 (xi, a; 3 , a; 2 , P’ 1 ) <=> TAC 2 (xy,X 3 ,x 2 , K 2 )) A 
(T j 4C , 1 (x 2 , a: 3 , xi, /vi) <=> TAC 2 (x 2 ,x 3 ,xy,I\ 2 )))). 

Theorem 5. Computation of rSPR / MAF on two rooted binary trees on the same set of taxa 
X is linear time FPT. That is, the optimum k can be computed in time 0{f{k) ■ |X|), for some 
computable function that only depends on k. 

Proof. An agreement forest of the two rooted trees Ty and T 2 induces an agreement forest (con¬ 
sisting of unrooted binary trees) of the same size of the unrooted interpretations of these trees, 
simply by ignoring the orientation of edges. Hence the treewidth bound described in Theorem [2] is 
still applicable, and the theorem follows. (Again, if required one can obtain the actual edge cuts, 
which can be used to build a MAF, in the same time bound by leveraging Theorem 4 of [IQ]). □ 

3.3 Hybridization Number 

In this section, we deal again with rooted trees, and thus we add a vertex labeled p to both trees 
to indicate the root location, as done for rSPR; see Figure [He). A rooted phylogenetic network 
(rooted network for short) N = (V(N), E(N)) on a set of taxa X is any rooted acyclic digraph in 
which no vertex has degree 2 (except possibly the root) and whose leaves are bijectively labeled 
by elements of A'. The hybridization number of N, denoted by h(N ), is defined as 

h(N)= Y, (S~(v)-1) = \E(N)\-\V(N)\ + 1 

veV(N):S~(v)>0 

where S~(v) denotes the indegree of v. 

Given a rooted network N on X and a rooted binary tree T on X', with X' C A, we say 
that T is displayed by N if T can be obtained from N by deleting a subset of its edges and any 
resulting degree 0 vertices, and then suppressing vertices with S~(v) = S + (v) = 1. 

We are now ready to define the hybridization number problem: 

Problem: HN(Ty,T 2 ) 

Input: Two rooted binary trees Ty, T 2 on the same set of taxa X. 


Output: A rooted network N displaying T\ and T 2 such that h(N) is minimum over all rooted 
networks with this property. 

The hybridization number for T\ and T 2 , denoted by h(Ti,T 2 ), is defined as the hybridization 
number of this minimum network. As done for TBR and rSPR, we can give a characterization 
of the hybridization number in terms of agreement forests. To do so, we need to define acyclic 
agreement forests. 

Let T = {Tj, F 2 , ■ ■ ■ , Fk} be an agreement forest for two rooted binary trees Ti and T 2 on the 
same set of taxa A', and let AG(Ti, T 2 , J-) be the directed graph whose vertex set is T and for 
which (Tj, Fj) is an arc iff i ^ j, and either 

(1) the root of Ti(C(Fi)) is an ancestor of the root of Ti(£(Tj)) in Tj, or 

(2) the root of T 2 (T(T))) is an ancestor of the root of T 2 (C(Fj)) in T 2 . 

We call T an acyclic agreement forest (AAF) for T\ and T 2 if AG(T\,T 2 , F) does not contain any 
directed cycle. A maximum acyclic agreement forest (MAAF), is an AAF of minimum size. 

The acyclicity condition is used to model the fact that species cannot inherit genetic material 
from their own offspring. The two problems defined above are closely related, as the following 
well-known result shows. 

Theorem 6 ([3]). Given two rooted binary trees Ti, T 2 on the same set of taxa X, we have that 
h(Ti, T 2 ) = |MAAF(Ti,T 2 )| - 1. 


The above equivalence formed the basis for results proving that both problems are NP-hard jS] 
and fixed parameter tractable [7]. 

Here we show an alternative proof that computation of hybridization number on two rooted 
binary trees with the same set of taxa X is FPT, again using MSOL. We will do this by demon¬ 
strating that \M AAF(T\,T 2 )\ =: k can be computed in time 0(f(k) ■ |A|) for some computable 
function / that depends only on k. Again, we will formulate a logical query on the display graph 
to answer the question “Is k < k'?” for increasing values of k ', until k' = k is reached and the 
answer to the query is “yes”. Unlike the formulations given earlier for TBR and rSPR, the query 
has no free variables, and the length of each query will grow as a function of k'. However, given 
that k! < k, the length will remain bounded by a function of k. Note that, if a MAAF of size k 
exists for Tj and T 2 , then an AF of size k exists too, and as argued for rSPR, if two rooted trees 
have an agreement forest of size k then so do the underlying, unrooted trees. So the treewidth 
bound of Theorem [2] is still valid, where k = \MAAF(Ti, T 2 )|, and this implies that an overall 
running time of the form 0(f(k) • |Aj) can be achieved. 

The major challenge when modelling MAAF is to encode the acyclicity constraints. It is not 
clear whether the formulations from the previous sections, in which agreement forests are modelled 
directly as sets of edge-cuts, can be (elegantly) extended to include acyclicity constraints. For 
this reason we choose to discard the agreement forest abstraction, using it only to generate the 
treewidth upper bound. For the actual modelling we use an alternative “elimination-ordering” 
characterization of MAAF/HN, first presented in [23], which we briefly summarize here. 

Given a rooted binary tree T on A, we say a subtree T' of T is pendant if there exists a vertex 
u of T such that T' = T u . In this case it is then natural to associate T' with the subset of X 
labeling its leaves, i.e. C(T'). We say that T' is a common pendant subtree of T) and T 2 if it is 
a pendant subtree of both Tj and T 2 . We call (<Sj, S 2 ,..., S p ) (p > 0) a common pendant subtree 
sequence of Tj and T 2 of length p if for every 1 < * < p, Si is a common pendant subtree of 
Ti — \Jj<iC(Sj) and T 2 — Uj<iC(Sj). We say that such a sequence is additionally a tree sequence 
if the two trees T\ — U j< p C(Sj) and T 2 — U j< p C(Sj) are identical. Informally, a tree sequence of 
length p describes a sequence of p common pendant subtrees that can be successively pruned from 
the original trees to reach a common core tree. If T\ and T 2 are already identical then we use the 
empty tree sequence 0, and take p = 0, to represent this. 

The results in [23] establish that h(T\,T 2 ) is equal to the smallest p such that a tree sequence of 
length p exists. This is the characterization of optimality that we will use i.e. each logical query will 


pose the question, “ Does a tree sequence of length k' exist?”. There is no need to model acyclicity 
in this formulation. However, we do need to model the concept common pendant subtree and the 
impact of earlier pruning steps on the original trees. 

Before writing down the MSOL formulation we need some new auxiliary predicates. The first 
predicate checks whether there is a path from x\ to X 2 within Z that survives the deletion of 
vertex u. This is similar to the PAC predicate defined earlier. 

path.SurvivesVertexCut(Z, Xi, x 2 , u) := (u ^ aq) A («/ aq)A 

((aq = x 2 ) V ^ 3 P, Q(Bipartition(Z, P, Q ) A 
£1 G P A X 2 G Q A (Vp, q(p G P A q G Q => 

- 1 adj{p , q)Wp = uVq = u)))) 


For a vertex u ^ p in a tree 7$ and a taxon x £ X, observe that x is in the clade rooted at u 
(i.e. in the label set of the pendant subtree rooted at u) if and only if (x = u) or deleting u from 
Ti destroys all paths from p to x (inside Tf). Hence: 

InCladeUnder 1 (u, x ) := (u = x) V ~<pathSurvivesVertexCut(Vi, p,x,u) 

This leads naturally to a predicate for testing whether C C X is a clade of Tp. 

Clade* (C) := 3 u G Vi{Mx{x InCladeUnder* (u,x))) 

As we shall see, it is useful to extend this predicate with an optional list Z 1; Z 2 ,... which represent 
subsets of X describing common pendant subtrees that have already been pruned from the tree. 

The statement Clade l (C, Z\, Z 2 ,.) evaluates to true if and only if C is a clade of I) after 

the taxa in Z\. Z 2 . .have been pruned away. (To avoid ambiguity the predicate automatically 

returns false if C intersects with any of the Z$.) Note that the list of Zj is shown in square brackets 
to emphasize that it is a “macro”: there will be a different predicate for each possible list length 
t. The list of Z, will never be longer than h(Ti,T 2 ), and length of the generated predicate will be 
bounded by a function of the list length, so the length of the overall logical query remains bounded 
by a function of h(T\,T 2 ). 

Clade 1 {C , [Zi, ., Z t ]) := NoIntersect(C , Z\) A ... A NoIntersect(C , Z t ) A 3u G Vi 

(Vx(x GC4> InCladeUnder 1 (u,x)) A 

\/x{InCladeUnder q, {u,x) => (x G C) V (x G Z\) V ... V (x G Zt))) 

We are now ready to define the CPS (i.e. “common pendant subtree”) predicate. We do this by 
observing that C C X corresponds to a common pendant subtree of T\ and T 2 if and only if C is 
a clade of both trees (this ensures that C is pendant in both trees) and the set of triplets induced 
by C is identical in both trees (this ensures that the pendant subtree has the same topology in 
both trees). 

CPS{T 1 ,T 2 , C) := Clade^C) A Clade 2 {C) A \/xVyVz(x , y, z G C A 

allDif f(x, y, z) => (Triplet 1 (x, y, z) <t=> Triplet 2 {x, y, z))) 

We extend this now with a list of Z t representing the taxa we have already pruned. This new 
version of the predicate evaluates to true if and only if C corresponds to a common pendant 
subtree in the two trees after all the Z. t have been pruned away. (Here we make implicit use of the 
fact that the Clade predicate immediately returns false whenever C intersects with the Z,;.) 

CPS(T 1 ,T 2 , C, [Zi, ..., Z t ]) := Clade 1 {C , Zi,..., Z t ) A Clade 2 {C , Z u ..., Z t ) A VxVyVz 

(x,y, z G C A allDif f(x, y , z) => ( Triplet 1 (£, y, z) 

Triplet 2 (x, y, z))) 




We are now ready to directly pose the question: is there a tree sequence of length k'l We can 
assume k' > 1 because k' = 0 is trivial to check in polynomial time. To make the formulation 
slightly more compact we actually construct a list of length k! + 1, where Ck'+i represents the 
taxa that still remain after the common pendant subtrees have been pruned away: we can then 
test that the sequence is a tree sequence (i.e. that a common core tree remains) by testing that 
CPS{T\,T 2 , Ck'+ i, Ci,, Cfc/) is true. Note that the HybNurn predicate is again a macro, whose 
expansion depends on k'. 

HybNum[k'](T 1 ,T 2 ) := 3C\,..., C k ', C k '+i(Partition(X, C \,..., C k '+i) A 
CPS{T u T 2 ,C u Q) A 
CP5(T 1 ,T 2 ,C 2j C 1 )A 
CP5(T 1 ,T 2i C 3 ,C 1 ,C 2 )A 


CPS(Ti,T 2 , Ck>+i,Ci ,..., Cfc)). 


The Partition predicate has the expected meaning and definition: 

Partition(X, Ci,... , Ck) := (A i^jNoIntersect(Ci, Cj))AVu(u G X (u G CiV u G C 2 V.. .V u G 

Ck)) 

Concluding, we have the following result: 

Theorem 7. Computation of hybridization number / MAAF on two rooted binary trees on the 
same set of taxa X is linear time FPT. That is, the optimum k can be computed in time 0(f(k) ■ 
|X|), for some computable function that only depends on k. 

3.4 Parsimony distance on binary characters 

Let T be an unrooted binary tree on a set of taxa X. A binary character f is simply a function 
/ : X —>• {red, blue}. An extension of / to T is a mapping g : V{T) —>• {red, blue} such that, 
for all x G X, g(x) = f{x). For a given character /, an optimal extension is any extension g of 
/ such that the number of bichromatic edges is minimized. The number of bichromatic edges in 
an optimal extension is called the parsimony score of / with respect to T, and denoted lf(T). 
The well-known algorithm by Fitch can be used to compute lf{T) (and an optimal extension) in 
polynomial time m- We shall describe Fitch’s algorithm in due course. The parsimony distance 
problem on binary characters, denoted d 2 MP , is defined as follows [171 . 

Problem: d 2 MP {Ti,T 2 ) 

Input: Two unrooted binary trees Ti, T 2 on the same set of taxa X 

Output: Construct a binary character / on X such that the value \lf(T{) — lf{T 2 )\ is maximized. 

We use d 2 MP to denote the optimum value of |Z/(Xi) — lf(T 2 ) |. The problem was recently shown 
to be NP-hard and APX-hard [251. If is not known whether the problem is FPT in d 2 MP . The 
following result, however, is already known. 

Lemma 1 (D3I)- Let T\,T 2 be two unrooted binary trees on the same set of taxa X. Then 
d\ip (Ti,T 2 ) < dTBn(Ti,T 2 ). 

Given two trees T\,T 2 as input to d 2 MP , it is not known whether the display graph D of T\ 
and T 2 has treewidth bounded by a function of d 2 MP . However, from Lemma [T] and earlier results 
in this article (Theorems |T| and [2]) it is clear that D has treewidth bounded by a function of 
d T BR{Ti,T 2 ). An MSOL formulation modelling d 2 MP , whose length is bounded by a function of 
d 2 MP , will therefore give a running time of the form f(drBR{Ti,T 2 )) • 0(|X|) for some computable 
function / that only depends on dTBR(Ti,T 2 ). We now give such a formulation. We will remain 


within the framework of [2], this time using the (“linear extremum”) optimization variant of 
MSOL. This allows us to maximize or minimize an affine function of (the cardinalities of) the free 
set variables in the query. 

The MSOL formulation we give here, which is based on an ILP formulation from [22] , maximizes 
lf(Ti) — lf(T 2 ). (To compute d 2 MP we need to use the MSOL machinery twice, once for lf(T\) — 
lf(T 2 ) and once for lf(T 2 ) — //(Tf), taking the maximum of the two results. The second call only 
differs in its objective function so we omit details). 

The basic idea is to range over all possible binary characters, simultaneously embedding two 
static formulation^] of Fitch’s algorithm to “compute” lf(Ti) — lf(T 2 ). 

Fitch’s algorithm proceeds as follows. If T is not rooted, we root it arbitrarily (by subdividing 
an arbitrary edge). The algorithm then works in two phases, a bottom-up phase which computes 
lf(T), and then a top-down phase which actually computes a corresponding extension. In the 
bottom-up phase, we start by assigning each taxon x the singleton set of colours S(x ) := {/(*)}. 
For an internal node u with children vi,v 2 we set S(u) := S(v\) f~l S(v 2 ) (if S(v 1 ) 0 S(v 2 ) / 0, 
in which case we say u is an intersection node ) and S(u) := S(v i) U S(v 2 ) (if S(v i) D S(v 2 ) = 0, 
in which case we say that u is a union node). The value lf{T) is equal to the number of internal 
nodes that are union nodes. (We omit a description of the constructive top-down phase as it is 
not relevant for this article). 

To translate this into an MSOL formulation, we begin by arbitrarily rooting Tj and T 2 and 
using p as the placeholder for the root, in the usual fashion. The central idea is to partition the 
vertices of each tree Tj into four possible subsets R 1 , B ', RB} and RB\j corresponding to the set 
of colours that Fitch allocates to each node, and distinguishing union events from intersection 
events: red, blue, {red, blue} (intersection node) and {red, blue} (union node). We therefore ask 
the MSOL formulation to instantiate the free set variables R l , B l ,RB} and RB{, (i £ {1, 2}) such 
that the expression \RB\j\ — \RB p \ is maximized. (If desired, this can then be made constructive 
via Theorem 4 of [TO]-) The only significant work is simulating the bottom-up execution of Fitch’s 
algorithm. In particular, encoding expressions which describe the state of a parent node u in terms 
of its two children v\,v 2 . 

We introduce the auxiliary predicate child 1 (u, v) which says that v is a child of u in Tj. We 
can model this as follows: v is a child of u in Tj if and only if there is an edge e in Tj such that v 
and u are both endpoints of e and there does not exist a path from p to v that survives the edge 
cut e. (Here we have specialized the PAC predicate from earlier so that it only takes a single edge, 
rather than a set of edges, as its fourth argument.) 

child l (u, v) := (u ^ v) A Be € Ei(R D (e, u ) A R D (e, v ) A ~<PAC(Vi, p, v, e)) 

For each tree Tj we add the following constraints, which encode (in this order): 


— The four subsets R, B, RBi and RBu partition the vertices of the tree; 

— A vertex in X can only be in R or B; 

— An internal node is in R if and only if (one child is in R and the other child is not in B); 

— An internal node is in B if and only if (one child is in B and the other child is not in R); 

— An internal node is in RBj if and only if (neither child is in R or B)-, 

— An internal node is in RBu if and only if (one child is in R and one child is in B ). 


4 Interestingly, the earlier phylogenetics MSOL articles 1101251 also used static formulations: in that case 
the classical polynomial-time algorithm of Aho. 





[PartitioniVi, R?, B\ RB }, RBf) j A G X(x (j? RB) Ax £ RB l v )j A 
Vu G y,(w ^ X => (u G .R* 4 => 3 ci, C2 G y((ci 7^ C2) A child\u , ci) A child\u, C2) A 

ci G ffAc 2 ^B ,; )))j A 
Vu £ Vi(u ^ X => (u £ B l <=> 3 ci, C2 G y((ci 7^ C2) A child?{u, c\) A child l (u, C2) A 

ci G B ! Ac 2 ^ff)))j A 
du £ Vi(u X => (u £ RB\ <f=> 3 ci, C2 G y((ci 7^ C2) A child?{u , ci) A child?(u , C2) A 

ci g Ac 1 ^B i Ac 2 ^ R? Ac 2 fL B*)))] A 


\/u £ Vi(u £ X => (u £ RB\j -G> 3 ci, C2 G y ((ci 7^ C2) A childfu , Ci) A child‘(u, c 2 ) A 

ci G i?* A c 2 G -B*))) 

Finally, we ensure that both trees select the same character as follows: 

Vx £ X((x £ R 1 AA x £ R 2 ) A (x £ B 1 <G> x £ B 2 )) 


This concludes the formulation. Then we have the following result: 

Theorem 8. <r MP (T\,T 2 ) is linear time fixed parameter tractable in parameter drBR(Ti,T 2 ). 


4 Conclusion 

We have demonstrated how agreement forests, which are intensively studied objects in the phy¬ 
logenetics literature, naturally lead to bounded treewidth in an auxiliary graph structure known 
as the display graph. This opens the door to compact, “declarative” proofs of fixed parameter 
tractability for a range of phylogenetics problems by formulating them in Monadic Second Order 
Logic (MSOL). Our formulations have introduced a number of logical predicates and design princi¬ 
ples that will hopefully be of use to other phylogenetics researchers seeking to utilize this powerful 
machinery elsewhere in phylogenetics. Indeed, a natural follow-up question is to ask: what are the 
essential characteristics of phylogenetics problems that are amenable to this technique? 
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